Can one predict DNA Transcription Start Sites by studying bubbles? 
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It has been speculated that bubble formation of several base-pairs due to thermal fluctuations 
is indicatory for biological active sites. Recent evidence, based on experiments and molecular dy- 
namics (MD) simulations using the Peyrard-Bishop-Dauxois model, seems to point in this direction. 
However, sufficiently large bubbles appear only seldom which makes an accurate calculation difficult 
even for minimal models. In this letter, we introduce a new method that is orders of magnitude 
faster than MD. Using this method we show that the present evidence is unsubstantiated. 

PACS numbers: 87.15.Aa,87.15.He,05.10.-a 



Double stranded DNA (dsDNA) is not a static entity. 
In solution, the bonds between bases on opposite strands 
can break even at room temperature. This can happen 
for entire regions of the dsDNA chain, which then form 
bubbles of several base-pairs (bp). These phenomena are 
important for biological processes such as replication and 
transcription. The local opening of the DNA double he- 
lix at the transcription start site (TSS) is a crucial step 
for the transcription of the genetic code. This opening is 
driven by proteins but the intrinsic fluctuations of DNA 
itself probably play an important role. The statistical 
and dynamical properties of these denaturation bubbles 
and their relation to biological functions have therefore 
been subject of many experimental and theoretical stud- 
ies. It is known that the denaturation process of finite 
DNA chains is not simply determined by the fraction 
of strong (GC) or weak (AT) base-pairs. The sequence 
specific order is important. Special sequences can have 
a high opening rate despite a high fraction of GC base 
pairs For supercoiled DNA, it has been suggested 
that these sequences are related to places known to be 
important for initiating and regulating transcription 0. 
For dsDNA, Choi et al found evidence that the formation 
of bubbles is directly related the transcription sites 
In particular, their results indicated that the TSS could 
be predicted on basis of the formation probabilities for 
bubbles of ten or more base-pairs in absence of proteins. 
Hence, the secret of the TSS is not in the protein that 
reads the code, but really a characteristics of DNA as ex- 
pressed by the statement: DNA directs its own transcrip- 
tion 0|. In that work, SI nuclease cleavage experiments 
were compared with molecular dynamics (MD) simula- 
tions on the Peyrard-Bishop-Dauxois (PBD) model 0,0 
of DNA. The method used is not without limitations. 
The SI nuclease cleavage is related to opening, but many 
other complicated factors are involved. Moreover, theo- 
retical and computational studies have to rely on simpli- 
fied models and considerable computational power. As 
the formation of large bubbles occurs only seldom in a 
microscopic system, MD or Monte Carlo (MC) methods 



suffer from demanding computational efforts to obtain 
sufficient accuracy. Nevertheless, the probability profile 
found for bubbles of ten and higher showed a striking 
correlation with the experimental results yielding pro- 
nounced peaks at the TSS Q. Still, the large statisti- 
cal uncertainties make this correlation questionable. To 
make the assessment absolute, we would either need ex- 
tensively long or exceedingly many simulation runs or a 
different method that is significantly faster than MD. 

In this letter, we introduce such a method for the cal- 
culation of bubble statistics for first neighbor interac- 
tion models like the PBD. We applied it to the sequences 
studied in Refs. 3] and, to validate the method and to 
compare its efficiency, we repeated the MD simulations 
with 100 times longer runs. The new method shows re- 
sults consistent with MD but with a lot higher accuracy 
than these considerably longer simulations. Armed with 
this novel method, we make a full analysis of preferential 
opening sites for bubbles of any length. This analysis 
shows that there is no strict analogy between these pref- 
erential sites and the TSS using equilibrium statistics. 
Hence, the previously found correlation must have been 
either accidental or due to some non-equilibrium effect, 
which remains speculative. We discuss this issue and, 
more generally, the required theoretical and experimen- 
tal advancements that could address the title's question 
definitely. 

The PBD model reduces the myriad degrees of freedom 
of DNA to an one-dimensional chain of effective atom 
compounds describing the relative base-pair separations 
Ui from the ground state positions. The total potential 
energy U for an N base-pair DNA chain is then given by 
U(y N ) = V 1 (y 1 ) + £^2 + W(y h w _ x ) with y N = 

{yi} the set of relative base pair positions and 

Vi{vi) = A(e- Q ^-l) 2 (1) 
W(vi, yi -i) = ^(l + pe-^+y-^^-y^f 
The first term Vi is the on site Morse potential describing 



2 



the hydrogen bond interaction between bases on oppo- 
site strands. Di and at determine the depth and width 
of the Morse potential and arc different for the AT and 
GC base-pair. The stacking potential W consists of a 
harmonic and a nonlinear term. The second term was 
later introduced [j| and mimics the effect of decreasing 
overlap between n electrons when one of two neighbor- 
ing base move out of stack. As a result, the effective 
coupling constant of the stacking interaction drops from 
K' = K (1 + p) down to K' = K . It is due to this term 
that the observed sharp phase transition in denaturation 
experiments can be reproduced. All interactions with 
the solvent and the ions are effectively included in the 
force-field. The constants K , p, a, Dat, -Dgc> oat, Qgc 
were parameterized in Ref. Q and tested on denatura- 
tion curves of short heterogeneous DNA segments. These 
examples show that, despite its simplified character, the 
model is able to give a quantitative description of DNA. 
Most importantly, it allows to study the statistical and 
dynamical behavior of very long heterogeneous DNA se- 
quences, which is impossible for any atomistic model. 

Despite these successes, it is important to realize the 
limitations of the model. The PBD model treats the 
A and T bases and the G and C bases as identical ob- 
jects. The stacking interaction is also independent of the 
nature of the bases. Moreover, there is a subtle point 
that needs further explanation. As the PBD model ba- 
sically represents a single dsDNA in an infinite solution, 
the probability for complete denaturation of a molecule 
of finite length, resulting in two single stranded DNAs, 
tends to unity with increasing time at any temperature. 
In the experiments, where the amount of solvated DNA 
is not infinitely diluted, this effect is counterbalanced by 
the recombination mechanism where two single stranded 
chains in solution come together and match their comple- 
mentary bases. Hence, in our calculations we will restrict 
the configurational space to the dsDNA only, first of all 
because it is a very good approximation in comparison to 
experiments which are not performed in the immediate 
vicinity of the denaturation transition and, secondly, be- 
cause it is a necessary condition to give a relevant mean- 
ing to the ensemble averages calculated within the PBD 
model. 

In microscopic terms, a configuration y N is called a 
dsDNA molecule when y\ < yo for at least one i € 
[1 : N] with j/o the opening threshold definition. Sim- 
ilarly, a configuration is completely denaturated when- 
ever yi > y for all i. The statistical average (A(y N )) 
is equivalent to the ratio of two A-dimensional inte- 
grals (A) = Jdy N A{y N )g(y N )/ fdy N g(y N ) with dy N = 
dywdyN-i ■ ■ ■ dyi and g the probability distribution den- 
sity. Numerical integration calculates these integrals ex- 
plicitly, while MD and MC calculates only the ratio. Usu- 
ally, the dimensionality of the system prohibits direct 
numerical integration making MD and MC far favorable. 
However, an increase of the computational efforts by a 



factor of two reduces the error by only a factor of \/2 in 
MD and MC, while the reduction can be quite dramatic 
in low dimensional systems using numerical integration. 
In the following, we show how to exploit this by cre- 
ating an effective reduction of the dimensions yielding 
an orders-of-magnitude faster algorithm for the bubble 
statistics calculation. To explain the algorithm, we need 
to define a set of functions 



i(vi) = 0(yi-yo), Qi(yi) = %o - y%) 
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where #(•) equals the Heaviside step function. 8i equals 
I if the base-pair is open and is zero otherwise. 9i is the 
reverse. These functions indicate whether a base-pair is 
open or closed. Using these, we define 
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JJ 6j for m odd (3) 
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which are 1 (0 otherwise) if and only if i is at the center of 
a bubble that has exactly size m. To shorten the notation 
we have dropped the yi dependencies. For even numbers 
it is a bit arbitrary where to place the center, but we 
defined it as the base directly to the left of the midpoint 
of the bubble. In order to have these quantities defined 
also near the ends of the chain, we use (9, = 1 for i = and 
i = N +1. The properties of interest are the probabilities 
for bubbles of size m centered at base-pair i provided that 
the molecule is in the double stranded configuration. 
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with fi = 1 — J^J ( 
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Here fj, = 1 except when all bases are open; then \x = 0. 
The partition function integrals are given by: 



Z= dy N e-^ N \ Z dm] - / d/e-^fe"'^ 1 
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Note that both Z as Z\\ are infinite, but their differ- 
ence is well defined. Now, we can make use of the 
fact that all integrals Zx are of the factorizable form 
Ex = Jdy N a i x\y N ,y N - 1 ) . . . (y 3 ,y 2 )a ( x [) (y 2 ,yi) us- 
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ing following iterative scheme 
z { x{V2) = I dy 1 a x (y2,yi) 
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The calculation of z x (yi) for a discrete set of n gr id val- 

ii 



(6) 
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FIG. 1: (color). The probability of bubble opening as function 
of bubble size and position for the AAVP5 promoter and the 
mutant sequence at 300 K. Probabilities in each row are nor- 
malized by a different factor <f>(m) = MAX[<^ m] ^> ] for i € 

[1, N] given in the lower panel. The 69 bp sequences start at 
index -46 and end at +23. The TSS is at +1, the mutation is 
at (+1, +2) were (A,T) bases are replaced by (G,C). Contrary 
to [3, the mutation effect is very local. 
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requires only n^ rld function evaluations whenever 
is known. Hence, a total of N ■ n^ rid function eval- 
uations are required instead of n^ rid which is a huge im- 
provement. Further increase can be obtained by intro- 
ducing proper cut-offs for the numerical integration. We 
use integration boundaries such that for alH: L < yi < R 
and \yi — y;-i| < d, which we control by a single input 
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parameter e 

and R = jjq + V Nd. Any configuration outside this range 
but with at least one base-pair closed will have a proba- 
bility density smaller than e/ (Z—Zu)- A strong decrease 



in the parameter e will only marginally increase the in- 
tegration boundaries. We took e = 10 -40 that is much 
smaller than necessary for our accuracy. After storing 
the following function values in matrices M 



(AT/GC) __ 
ij 



exp(-/3[^AT/Gc(£ + i&y) + W(L + zAy, L+(i + j)Ay)]) 
with < i < INT[(i? - L) /Ay] and -INT[d/Ay] < 
j < INT [dj Ay] we can reduce the integral operations 
for Eq. © (using Simpson's rule) into inexpensive mul- 
tiplication and addition operations only. 

As a first investigation, we applied this new method 
on the adeno-associated viral P5 promoter and the mu- 
tant from Refs. using y Q — 1.5 as opening thresh- 
old which corresponds to 2.1 A in real units. To make 
the comparison with MD which uses periodic boundary 
conditions (PBC), we replicated the chain at both ends, 
but only computed the statistics for the middle chain. 
This approach, is cheaper than true PBC which scales 

as N ■ (n gr id) 3 . The full probability matrix (of n ^ was 

calculated for the middle sequence up to bubbles of size 
m = 50. A fraction of this matrix is presented in Fig. 
in a color plot. In agreement with Ref. we find pref- 
erential opening probabilities at the TSS site at +1 that 
vanishes after the mutation. But contrary to the results 
of Ref. 0, we find that the TSS is not at all the most 
dominant opening site. Stronger opening sensitivity is 
found at the -30 region. Moreover at variance with the 
previous established findings, Fig-.^shows that the muta- 
tion effect is very local. In Fig.[21we make a projection by 
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FIG. 2: (color). The probabilities for bubbles larger than 
10 bp for the AAVP5 promoter and the mutant at 300 K. 
Both semi-PBC (three-fold replicated system) and loose ends 
(single chain) are compared and two values for the opening 
threshold yo = 1.0 and yo = 1.5. MD results (black) for 
j/o = 1-5 with PBC are also given with corresponding error- 
bars. A change of scale in the y axis is applied to include 
the higher openings at the free boundaries. All results agree 
but are different from the less accurate results of 3], The 
mutation and the free boundaries only have a local impact on 
the bubble statistics. 



looking at the probability Pj 
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that at 



site i one can find a bubble of size 10 or larger. We com- 
pared different boundary conditions and two values for 
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Uq. In addition, we made the comparison with MD |9j 
by performing 100 simulations of 100 ns with different 
friction constants 7 in the Langevin MD and 10 simu- 
lations of 1 fj,s using Nose-Hoover. The curves matched 
within the statistical errors and agreed with the integra- 
tion method (see for instance Fig. [21 where the Langevin 
7 = 10 results are plotted together with the results of the 
integration method). 

We obtained relative errors around 10 % for Nose- 
Hoover and Langevin with 7 = 10 and 5 ps _1 . The 
errors of the 7 = 0.05 ps -1 , used in Ref. 0, were con- 
siderably larger due a stronger correlation between suc- 
cessive timesteps. The results of 0] were based on 100 
times fewer statistics. Hence, the corresponding errors in 
must have been 10 times larger which can explain the 
variance with our results. Another explanation could be 
that the results of are due to some out-of-equilibrium 
or dynamical effects. Such effects depend strongly on the 
choice of initial conditions, which poses the problem of 
defining biologically significant initial conditions and de- 
termining, in a meaningful way, the relevant time scale 
along which the simulations have to be carried to detect 
such non-equilibrium phenomena. 

The principal error in the new method is mainly due 
to the finite integration steps. To estimate the accuracy, 
we compared Ay = 0.1 and 0.05 with the almost exact 
results of Ay = 0.025. Using the TSS peak of the AAVP5 
sequence with free boundaries as reference, we found that 
the systematic error drops from ~ 5 % to 0.03 % for CPU 
times of 40 minutes and 3 hours only. For comparison, 
the last accuracy would take about 200 years with MD 
on the same machine. The evaluation of larger bubbles 
becomes increasingly more difficult for MD. Bubbles of 
size 20 showed statistical errors > 100 % while these were 
only slightly increased for the integration method. It is 
interesting to note that the 10 bp size is more or less the 
upper limit for which one get sufficient accuracy using 
MD, while it is a lower limit were its relation to bio- 
physics becomes interesting stressing the importance 
of our method. Finally, we calculated the Pi probabilities 
for the adenovirus major late promoter (AdMLP) and a 
control non promoter sequence (Fig. |3J. Also here, our 
results violate the TSS conjecture. The TSS shows some 
opening, but cannot be assigned on basis of bubble pro- 
file only. Surprisingly, even the control sequence shows 
significant opening probabilities. 

To conclude, we have shown that MD (or MC) encoun- 
ters difficulties to give a precise indication of preferential 
opening sites. In particular, information of large bub- 
bles is not easily accessible using standard methods. The 
method presented here is orders of magnitude faster than 
MD without imposing additional approximations. Using 
this method, we showed that the TSS is generally not the 
most dominant opening site for bubble formation. These 
results contradict foregoing conjectures based on less ac- 
curate simulation techniques. However, to address the 
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FIG. 3: (color). Same as Fig.^for the 86 bp AdMLP and the 
63 bp non-promoter control sequences. The biological non- 
active control sequence shows considerable opening probabil- 
ity, even more than the biological active AdMLP promoter. 



title's question, definitely, there are still many issues to 
be solved. Still, there is some chance that bubble dynam- 
ics rather than bubble statics is indicatory for the TSS. 
Speculatively, the previously found correlation could be 
justified using this argument. However, a statistical sig- 
nificant foundation for this is lacking and it is highly 
questionable whether the PBD model and this type of 
Langevin dynamics can give a sufficiently accurate de- 
scription for the dynamics of DNA. The PBD model 
could and, probably, should be improved to give a correct 
representation of the subtile sequence specific properties 
of DNA. Base specific stacking interaction seems to give 
better agreement with some direct experimental obser- 
vations 8J. Also, the development of new experimental 
techniques is highly desirable. Our method is not lim- 
ited to the PBD model or to bubble statistics only, but 
it works whenever the proper factorization © can be 
applied. Therefore, we believe that the technique pre- 
sented here will remain of importance for the future in- 
vestigations of bubbles in DNA and their biological con- 
sequences. 
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